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Abstract: Estimating the parameters of general state-space models is a topic of importance 
for many scientific and engineering disciplines. In this paper we present an online parameter es¬ 
timation algorithm obtained by casting our recently proposed particle-based, rapid incremental 
smoother (PaRIS) into the framework of online expectation-maximization (EM) for state-space 
models proposed by Cappe (2011). Previous such particle-based implementations of online EM 
suffer typically from either the well-known degeneracy of the genealogical particle paths or 
a quadratic complexity in the number of particles. However, by using the computationally 
efficient and numerically stable PaRIS algorithm for estimating smoothed expectations of time- 
averaged sufficient statistics of the model we obtain a fast algorithm with very limited memory 
requirements and a computational complexity that grows only linearly with the number of 
particles. The efficiency of the algorithm is illustrated in a simulation study. 
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1. INTRODUCTION 

This paper deals with the problem of online parameter 
estimation in general state-space models (SSM) using se¬ 
quential Monte Carlo (SMC) methods and an expectation- 
maximization (EM) algorithm. SSMs, which are also re¬ 
ferred to as general hidden Markov models, are currently 
applied within a wide range of engineering and scientific 
disciplines; see e.g. Cappe et al. (2005, Chapter 1) and 
the references therein. In its most basic form (as proposed 
by Dempster et ah, 1977), the EM algorithm, which is 
widely used for estimating model parameters in SSMs, 
is an offline algorithm in the sense that every recursive 
parameter update requires the processing of a given batch 
of data. When the batch is very large or when data is 
received only gradually in a stream, this approach may be 
slow and even impractical. In such a case, using an online 
EM-algorithm is attractive since it generates a sequence 
of parameter converging towards the true parameter by 
processing recursively the data in a single sweep. 

The algorithm we propose is a hybrid of the online EM 
algorithm proposed by Cappe (2011) and the efficient 
particle-based online smoothing algorithm suggested re¬ 
cently by Olsson and Westerborn (2014). On the contrary 
to previous algorithms of the same type (see e.g. Del Moral 
et ah, 2010), which have a quadratic computational com¬ 
plexity in the number of particles, our algorithm stays nu¬ 
merically stable for a complexity that grows only linearly 
with the number of particles. 


* This work is supported by the Swedish Research Council, Grant 
2011 - 5577 . 


2. PRELIMINARIES 

We will always assume that all distributions admit den¬ 
sities with respect to suitable dominating measures and 
we will also assume that all functions are bounded and 
measurable. 

In SSMs, an unobservable Markov chain {AtjtgN (the 
state process), taking values in some space X and having 
transition density and initial distribution qg and y, respec¬ 
tively, is only partially observed through an observation 
process {TtjteN taking values in Y. Conditionally on state 
process, the observations are assumed to be independent 
with conditional distribution gg of each Yt depending on 
the corresponding state Xt only; i.e., 

P,(yt e A I Ao:t) = f geiXt,y)dy, 

Ja 

for all (measurable) A C X. The densities above are in¬ 
dexed by a parameter vector 9 that determines completely 
the dynamics of the model. 

Given a sequence of observations j/o:i) the likelihood of 9 
is given by 

LeiVo-.r) = 

T 

geixo, yo) x{xo) F yt)q0ixt-i,xt) dxo,T. 

i=l 

Unless we are operating on a linear Gaussian model or 
a model with a finite state space X, this likelihood is 
intractable and needs to be approximated. If we wish to 
infer a subset of the hidden states given the observations, 
the optimal choice of distribution is the conditional dis¬ 
tribution (j)s:s'\T ;0 of Xs-.s', (s < s') given the observations 
Yq,t. This is given by 



4>a-.s'\T-s{Xs-.s') = Lg {iJo-.t) 


g0{xo,yo)x{xo) 


T 

X n 90 {xt, yt)qe{xt-i,Xt) dxors-i da:s/+i:T- 

t=i 


We refer to (jiT-fi = 4>t\t-.0 as the filter distribution and to 
4’o:T\t-0 as the joint smoothing distribution. 


The EM algorithm computes the maximum-likelihood 
estimator using some initial guess Bq of the same. It 
proceeds recursively in two steps. First, in the E-step, 
given some parameter estimate 9i, it computes the xy 
intermediate quantity 


Q{9,9j) = Eg. 


.*=0 


Y,^oggg{X^,Y^)\Y^,T 

T-1 

^ogqg{Xt,Xt+i) I Yq,t 


■ Efl 


t=o 


where E^i. denotes the expectation under the dynamics de¬ 
termined by the parameter 9i, and in the second step, the 
M-step, it updates the parameter fit according to 9i+i = 
a,Tgmax.Q{9,9i). Under weak assumptions, repeating this 
procedure produces a sequence of parameter estimates 
that converges to a stationary point of the likelihood. If 
the joint distribution of the SSM belongs to an exponential 
family, then the intermediate quantity may be written as 

Q{9,9i) = {4>{9), 4>o:T\T-,0i {st)) — c(9), 


where (•,•) denotes scalar product, (j){9) and c{9) are 
known functions, and st is a (vector-valued) sufficient 
statistic of additive form. Given the smoothed sufficient 
statistic (f)Q:T\T-, 0 i{sT), the M-step of the algorithm can 
be typically expressed in a closed form via some update 
function 0^+1 = E{(l)Q.^T\T-, 0 i{sT)/T). Hence, being able to 
compute such smoothed expectations is in general crucial 
when casting SSMs into the framework of EM. 


As mentioned, the sufficient statistic st{xo-.t) is of additive 
form, i.e. 

T-l 

5T(a:^0:T) = ^ 

where all functions are possibly vector-valued. We denote 
vector components using superscripts, i.e., St{xt:t+i) = 
{sl{xt:t+l), ■ ■ ■ ,si{Xf.t+l))- 
Example 1. Consider the linear Gaussian SSM 
Xt+i = aXt-\-ayVt, , ^ 

Yt = Xt+auUt, 

where {VjjtgN and {UtjtgN are mutually independent 
sequences of independent standard Gaussian variables. 
The parameters of this model are 9 = {a,ay,<j"lj) and 
the model belongs to an exponential family with sufficient 
statistics given by 

s\{xt,t+l) = xl, sl{Xf.t+l) = XtXt+l, 
sUxf.tg-i) = a;?+i, sf{xf.t-gi) = {ytg-i - xt+if. 

The M-step update function is then given by 

A{zi,Z2,Z3,Z4) = ( —,Z 3 - —, 2:4 

zi 


When computing smoothed expectations of additive form 
it is advantageous to use the backward decomposition of the 


joint smoothing distribution. This decomposition comes 
from the fact that the state process is, conditionally on 
the observations, still Markov in the forward as well as 
backward directions. The backward kernel he., the 

distribution of Xt conditionally on and Yq-x, is given 

by 

^ / X (l)f,0ixt)q0ixt,xt+i) 

q <i>f,e',0\Xt+l)Xt) — n 'i j ~ ■ 

j (j)t-0[x^t)q0[Xt,Xt+l)'XXt 

Using the backward kernel, the joint smoothing distribu¬ 
tion may be written as 

T-l 

4’O-.T\T-0{xo-.t) = ^(t>f_g-0{xt+l,Xt)(j)T-,0{xT)- 

t=0 

The backward distribution can also be used effectively 
when estimating expectations of additive form; indeed, if 
Tt{xt) =t~^f St(xo:t) n^Zo^<i>r,e;0(x£-hi, X£) dxo:t-i, then 
fo:t- 0 it~^St) = 4>ffi{Tt) and 

Tt{xt) = J{jtSt-l{xt-l:t) + (1 - Jt)Tt-l{xt-l)} 

^^<Pt-i;g-,0ixt,Xt-l)dXt-l, (I) 
where 74 = t~^. The recursion is initialized by tq = 0. 
The fundamental idea of the online EM algorithm is 
to update sequentially the smoothed sufficient statistics 
using (I) and plugging these quantities into the updating 
procedure A. In practice one uses, rather than ■jt = 
t~^, some step size satisfying the regular stochastic 
approximation requirements Xt = co and YZtLo Xt < 

00 . Nevertheless, since the backward kernel involves the 
filter distributions, the recursion ( 1 ) cannot be computed 
in a closed form, and we are hence forced to approximate 
the same. We will use particle methods for this purpose. 

3. PARTIGLE METHODS 


A particle filter updates sequentially, using importance 
sampling and resampling techniques, a set 
of particles and associated weights targeting a sequence 
of distributions. In our case we will use the particle filter 
to target the filter distribution flow {(/)qe}teN in tbe sense 
that 


N i 

0 qe(/) = X] asN^oo, 




where O* = Notice that the parameters and 

weights depend on the parameters even though this is 
implicit in the notation. In the bootstrap particle filter, the 
sample is updated by, first, resampling the 

particles multinomially according to weights proportional 
to the particle weights, second, propagate the resampled 
particles forward in time using the dynamics of the state 
process space, and, third, assigning the particles weights 
proportional to the local likelihood of the new observa¬ 
tion given the particles. The update, which is detailed 
in Algorithm I, will in the following be expressed as 

In the previous scheme, PrdwflUi) refers to the discrete 
probability distribution induced by the weights 
As a by-product, the historical trajectories of the particle 
filter provide jointly an estimate of the joint-smoothing 
distribution. These trajectories are constructed by linking 



Algorithm 1 Bootstrap particle filter 

Require: Parameters 9 and a weighted particle sample 

1; for f = 1 —> TV do 
2 : 

3; Draw ~ , xt) dxt, 

4: Set <- 

5: end for 

6: return {(etVi,u;j+i)}iIi 


up the particles with respect to ancestors. However, this 
method suffers from a well-known degeneracy phenomenon 
in the sense that the repeated resampling operations col¬ 
lapse the particle lineages as time increases. Consequently, 
the weighted empirical measures associated with the paths 
degenerate in the long run; see Olsson et al. (2008) for some 
discussion. 


A way to combat the degeneration is to use instead 
the backward decomposition presented above. Using the 
output of the bootstrap particle filter we obtain the 
particle approximation 




^lqs(Ct,xt+i) 

E£=i^he(^f,xt+i) 


of the backward kernel. Plugging this into the backward 
decomposition we arrive at the forward-filtering backward¬ 
smoothing (FFBSm) algorithm, where (/>o:T|T;e(/) is ap¬ 
proximated by 


N N T-1 

4’0:T\T-eif) = ^ ^ n 


io — 1 iT — l ^—0 




For a general objective function /, this occupation measure 
is impractical since the cardinality of its support grows 
geometrically fast with T. In the case where / is of additive 
form the computational complexity is quadratic since the 
computation of the normalizing constants is required for 
each particle and each time step. Consequently, FFBSm is 
a computationally intensive approach. 


In the case where the objective function is of additive form 
we can use the forward decomposition presented earlier to 
obtain an online algorithm; see Del Moral et al. (2010). We 
denote the auxiliary functions initialized by setting 

Tg = 0 for all i = {!,..., A}. When a new observation 
is available, an update of the particle sample is followed 
by a recursive update of the auxiliary functions 
according to 


rt+i = 


N 

E 




After this, the FFBSm estimate is formed as 


( 2 ) 


N i 
2=1 




Appealingly, this approach allows for online processing 
of the data and requires only the current particles and 
auxiliary statistics to be stored. Still, the computational 
complexity of the algorithm grows quadratically with the 


number of particles, since a sum of N terms needs to be 
computed for each particle at each time step. 

To speed up the algorithm, Olsson and Westerborn (2014) 
propose, in the particle-based rapid incremental smoother 
(PaRIS) algorithm, (2) to be replaced by a Monte Carlo 
estimate. Given {T(_i}(^i we update the auxiliary statistic 

by drawing indices according to the backward 

dynamics governed by the particle filter, i.e., drawing 

After this, each auxiliary statistic is updated through the 
Monte Carlo estimate 

N 

and the estimate of (f>o:t\t; 0 {st) is obtained as /^t- 

Again, the procedure is initialized by setting Tq = 0 for 
i = {!,...,A}. In this naive form the computational 
complexity is still quadratic; however, this approach can 
be furnished with an accept-reject trick found by Douc 
et al. ( 2011 a), which reduces drastically the computational 
work. The accept-reject procedure can be applied when the 
transition density for the hidden chain is bounded, i.e., 
there exists some finite constant q'g such that qg{x,x') < 
qg for all {x,x') G X^. This is a very weak assump¬ 
tion which is generally satisfied. In the scheme, an index 
proposal J* drawn from Pi'({wJ_i}(^i) is accepted with 
probability qg{^f_i>^t)/q4 procedure is repeated 

until acceptance. Under the strong mixing assumption (see 
below for details) it can be shown that the expected 
number of proposals is bounded] see Douc et al. (2011a); 
Olsson and Westerborn (2014). Consequently, the overall 
computational complexity of PaRIS is linear. In PaRIS A 
is a design parameter that should be at least 2 to achieve 
a stable algorithm; see Olsson and Westerborn (2014) for 
details. 

Olsson and Westerborn (2014) present a thorough the¬ 
oretical study of the PaRIS algorithm. Under the weak 
assumption that the emission density is bounded and 
positive, it is established that the following central limit 
theorem holds true for all fixed t and A: as A —>■ oo, 

N 

i=l 

where Z has standard Gaussian distribution and 

s=0 £=0 

with of being the asymptotic variance of the FFBSm 
algorithm; see Olsson and Westerborn (2014, Corollary 5) 
for details. In the same work, the numerical stability of 
PaRIS is studied under the strong mixing assumptions 

(i) 0 < < qg{x,x) < q^ for all {x,x) G X^, 

(ii) || 5 e(-,Pi)||oo < Sg and 0 < < / gg{x,Yt)qg{x,x)dx 

for all X, t, 

which are standard in the literature and point to appli¬ 
cations where the state space X is compact. Moreover, it 
is assumed that there exists a constant |s|oo such that for 
all t, osc(st) < |s|oo < oo. Under these assumptions it is 



fit 


{rl - (/>0:i|M(st)) ^ crt{st)Z, 


Ui.j) r(i,3) . 

r/di ,Ct) 


(3) 




shown that there exist constants c and c such that, for 
N >2, 

Iimsup;|;cr 2 (s 0 < |s|^ fc+ 

i->oo t TV — 1 

This bound implies that the Monte Carlo error added by 
PaRIS at each time step is uniformly bounded in time, 
which shows that the algorithm stays numerically stable in 
the long run. Moreover, as the previous bound is inversely 
proportional to N, we may draw the conclusion that N 
should be kept at a moderate value, say, less than 10 . 
Consequently, the asymptotic variance is of order 0{t), 
which is the best possible for a Monte Carlo approximation 
on the path space; see Olsson and Westerborn (2014, 
Theorem 8 ) for details. See also Del Moral and Guionnet 
(2001) and Douc et al. (2011a) for similar stability results 
for the standard particle filter and the FFBSm algorithm, 
respectively. 

Now, we may cast the PaRIS algorithm into the framework 
of the online EM algorithm of Cappe (2011) by simply 
replacing (3) by the the updating formula 

N 

1=1 

where again the sequence {qtjtgN should satisfy the usual 
stochastic approximation requirements. A standard choice 
is to set jt = t~°‘ for .5 < a < 1. The algorithm is 
summarized in Algorithm 2. 




r(i)j) 


(l-7i)A-i +7tst-ite-i ,Ct) 




t-1 


7tSt_i(Xt_i,Xt) + (1 - 7t)y^ 

X ( n leSi.iiXi.uXe) \ Po:* 


Vi=^+l 


where the subscript Oo-.t-i indicates that the expectation 
is taken under the model dynamics governed by Oi at 
time i + \. Thus, if the sequence {0t}tgN converges, we 
may expect, since the factor — 7 i) reduces the 

influence of early parameter estimates, that a fixed point of 
the online EM procedure coincides with a stationary point 
of the asymptotic contrast function, i.e., by identifiability, 
the true parameter value (see Douc et al. (2011b)). At 
present, a convergence result for the online EM algorithm 
for SSMs is lacking (a theoretical discussion is however 
given by Cappe, 2011), but for independent observations 
(e.g., mixture models) convergence is shown in Cappe and 
Moulines (2009). 


Another algorithm that is worth mentioning here is the 
block online EM algorithm (Le Corff and Eort, 2013), 
where the parameter is only updated at fixed and increas¬ 
ingly separated time points. This algorithm can be shown 
to converge; however, simulations indicate that this block 
processing approach is less advantageous than updating 
the parameter at every time step. An overview of param¬ 
eter estimation methods is given by Kantas et al. (2014). 


4. SIMULATIONS 


Algorithm 2 PaRIS-based online EM 
1; set Tq = 0 for z = {1,..., N}; 

2: under some 9o, generate {(Co)‘X’o)}fc=i targeting <j)o-gg; 
3; for t = 1 —> T do 

4: run ^ PF({(C*_i,wj_i)}iIi;6»t_i); 

5; for z = 1 —> TV do 
6; for j = 1 —> TV do 

7: draw 4*’^) ~ Pr{{ooUq0,_Aet-i,et)}ti); 

8: end for 

9: set 

N 

i=i 

10 : set ^ A ; 

11; end for 

12; end for 
13; return 9t- 




jiiJ) 


(l- 7 t)Adi + 7 tSt-i(Czli ,Ct) 


As mentioned, the standard batch EM algorithm update, 
at iteration z -|- 1 , the parameters using 




t 


fi-i 


Xi^l) I Fo:i 

7=0 


and it can be established, under additional assumptions, 
that every fixed point of EM is indeed a stationary point of 
the likelihood. The online EM algorithm updates instead, 
at iteration (i.e., time step) t, the parameters on the basis 
of 


We test the algorithm on two different models, the linear 
Gaussian model in Example 1 and a stochastic volatility 
model. With these simulations we wish to show that 
the PaRIS-based algorithm is preferred to the EFBSm- 
based version. In the implementations we start updating 
the parameter first after a few observations have been 
processed in order to make sure that the filter estimates 
are stable. 

4-1 Linear Gaussian model 

For the linear Gaussian model in Example 1 we com¬ 
pare the parameter estimates produced by the PaRIS- 
based algorithm with those produced by the EFBSm- 
based algorithm of Del Moral et al. (2010). The observed 
data are generated by simulation under the parameters 

9 = (. 8 , .4^, .9^). We tune the number of particles of both 
algorithms and the number of backward draws TV in the 
PaRIS-based algorithm such that the computational time 
of both algorithms are similar. This implies 250 particles 
for the FEBSm-based algorithm and 1250 particles and 
TV = 5 for the PaRIS-based algorithm. We also restrict 
ourselves to estimation of the parameters a and cry. 

In Figure 1 we present output of the algorithms based on 

10 independent runs on the same simulated data, where 

9q = (. 1 , 2 ^,.9^) and where we update only the a and 
Gy parameters. We set 7 * = ® and start updating the 

parameters after 60 steps. As clear form the plot, both 
algorithms tend towards the true parameters. In addition, 
the PaRIS-based algorithm exhibits, as a consequence of 
the larger particle sample size, a lower variance. 
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Fig. 1. FFBSm-based (left boxes) and PaRIS-based esti- 
mates (right boxes) of a (panel a) and ay (panel b) 
in the linear Gaussian model. Dashed horizontal lines 
indicate true parameter values. 

4-2 Stochastic volatility model 


The stochastic volatility model is given by 

Xt+i = cl)Xt + aVt, 


Yt = aexp{Xt/2)Ut, 


ten, 


where again {VtjtgN and {DtjtgN are independent se¬ 
quences of mutually independent standard Gaussian noise 
variables. The parameters of the model are 9 = {(p, tr^, /3^), 
the sufficient statistics are given by 

(^r:t-t-i) 7 (^ru-t-i) 

St{xf.t+i) = s^ixf.t+i) = Vt+i exp{-xt+i), 

and parameter updates are given by 

A / s (Z2 

A{zi,Z2,Z3,Z4) = { —,Z3 -, Z4 

Zi 

We generate the data through simulation using the param¬ 
eters 9 = (.975, .16^, .63^). Again, we set the parameters 
of the algorithm in such a way that the computational 
times of the two algorithms are similar. This implies 110 
and 500 particles for the FFBSm-based and PaRIS-based 
algorithms, respectively. In addition, the latter used iV = 4 
backward draws. 


In Figure 2 we present the output of both algorithms from 
20 independent runs using the same data input for each 
run. We initialize the algorithms with 9q = (.5, .8^,1^), 


use 7t = and start the parameter updating after 

60 observations. We notice, as for the previous model, 
that both algorithms seem to converge towards the correct 
parameters and that the FFBSm-based algorithm exhibits 
the higher variance. 

Finally, to show that the algorithm indeed converges we 
perform one run of the algorithms on the stochastic volatil¬ 
ity model parameterized hy 9 = (.8, .1,1) using 9q = 
(.1,.1^,2^) and as many as T = 2,500,000 observations. 
For the FFBSm-based algorithm we used = 125 parti¬ 
cles and for the PaRIS-based algorithm we used N = 500 
and N = 2. Both algorithms use = t~'^ and do not 
update the parameters for the first 60 observations. The re¬ 
sults are reported in Figure 3, which indicates convergence 
for both algorithms. Taking the mean of the last 1000 pa¬ 
rameter estimates yields the estimates (.802, .093,1.01)and 
(.807, .084,1.03) for the PaRIS-based and FFBSm-based 
algorithms, respectively. 

5. GONGLUSIONS 

We have presented a new particle-based version of the 
online EM algorithm for parameter estimation in general 
SSMs. This new algorithm, which can be viewed as a 
hybrid of the PaRIS smoother proposed by Olsson and 
Westerborn (2014) and the online EM algorithm of Gappe 
(2011), has a computational complexity that grows only 
linearly with the number of particles, which results in a fast 
algorithm. Gompared to existing algorithms, this allows 
us to use considerably more particles and, consequently, 
produce considerably more accurate estimates for same 
amount of computational work. 
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